EDA

Reading the file with all the customer churn details as a data frame.Then displaying the structure of the data frame

churn_data<-data.frame(read.csv("Bank_Customer_Churn_Prediction.csv",header=TRUE)) #reading the file into data frame
str(churn_data) #printing the data frame structure
## 'data.frame':    10000 obs. of  12 variables:
##  $ customer_id     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ credit_score    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ country         : chr  "France" "Spain" "France" "France" ...
##  $ gender          : chr  "Female" "Female" "Female" "Female" ...
##  $ age             : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ tenure          : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ balance         : num  0 83808 159661 0 125511 ...
##  $ products_number : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ credit_card     : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ active_member   : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ estimated_salary: num  101349 112543 113932 93827 79084 ...
##  $ churn           : int  1 0 1 0 0 1 0 1 0 0 ...

The churn data frame as 10000 customers from the with each customer having 12 following attributes; Customer ID :The Unique ID of each individual customer

Credit Score : A number depicting the customer’s creditworthiness

Country : The country the customer banks from

Gender : The gender the customer identifies with

Age : Depicts the customers age

Tenure : Indicates how length in years the customer has been with the bank

Balance : The amount currently available in the customer’s account

Products Number : The number of products purchased by the customer through the bank

Credit Card : Indicates the customer has a credit card

Active Member : Indicates if the customer is an active or inactive

Estimated Salary : Bank Estimation of the income of the customer

Churn : Indicator of if the customer has left the bank or not

Checking for duplicates in the data frame.

nrow(unique(churn_data)) #checking for unique rows

There are 10000 rows of unique records in the data.Hence we the data frame has no duplicates records.We can proceed with next step of analysis.

Checking if there are any NA values in the data frame

sum(is.na(churn_data) == 'TRUE') # checking for NA values in the dataframe

We found that our dataset was clean without any NA values.

Checking for summary of the churn data frame.

summary(churn_data) #looking for summary of the data frame
##   customer_id        credit_score   country             gender         
##  Min.   :15565701   Min.   :350   Length:10000       Length:10000      
##  1st Qu.:15628528   1st Qu.:584   Class :character   Class :character  
##  Median :15690738   Median :652   Mode  :character   Mode  :character  
##  Mean   :15690941   Mean   :651                                        
##  3rd Qu.:15753234   3rd Qu.:718                                        
##  Max.   :15815690   Max.   :850                                        
##       age           tenure         balance       products_number
##  Min.   :18.0   Min.   : 0.00   Min.   :     0   Min.   :1.00   
##  1st Qu.:32.0   1st Qu.: 3.00   1st Qu.:     0   1st Qu.:1.00   
##  Median :37.0   Median : 5.00   Median : 97199   Median :1.00   
##  Mean   :38.9   Mean   : 5.01   Mean   : 76486   Mean   :1.53   
##  3rd Qu.:44.0   3rd Qu.: 7.00   3rd Qu.:127644   3rd Qu.:2.00   
##  Max.   :92.0   Max.   :10.00   Max.   :250898   Max.   :4.00   
##   credit_card    active_member   estimated_salary     churn      
##  Min.   :0.000   Min.   :0.000   Min.   :    12   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.: 51002   1st Qu.:0.000  
##  Median :1.000   Median :1.000   Median :100194   Median :0.000  
##  Mean   :0.706   Mean   :0.515   Mean   :100090   Mean   :0.204  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:149388   3rd Qu.:0.000  
##  Max.   :1.000   Max.   :1.000   Max.   :199992   Max.   :1.000

Credit score:We see that minimum credit score of the customer is 350 with average score being 651 and 850 being maximum score. Age:Minimum age of the customer is 18 with average age being 38.9 and maximum age being 92. Tenure:The customers has been with bank for an average of 5 years and upto 10 years. Balance:Maximum account balance of the customer is 250898 with average balance being 76486. Country,gender,active member,credit card,product n umber and churn are categorical variables.

Customer ID is not an important attribute that impacts churn rate

churn_data <- churn_data[-c(1)] # removing column 1(customer _id) from data set
ncol(churn_data) # counting number of variables in the data frame
str(churn_data) # checking for summary of the data frame after removing the customer_if field

We are not making analysis for each customer,Hence customer_id was irrelevant field in the data set so we decided to remove the customer_id column from the data frame. After removing customer_id attribute from the data frame we now have 11 columns.

Independent variable EDA

Converting required variables into categorical values

str(churn_data)
churn_data_category=churn_data
churn_data_category$credit_card=factor(churn_data$credit_card)
churn_data_category$active_member=factor(churn_data$active_member)
churn_data_category$churn=factor(churn_data$churn)
churn_data_category$gender=factor(churn_data$gender)
churn_data_category$tenure =factor(churn_data$tenure )
churn_data_category$products_number=factor(churn_data$products_number)
churn_data_category$age=factor(churn_data$age)
str(churn_data_category)

We decided to convert these variables into categorical (factor):
credit_card, active_member, churn and gender,product_number,tenure,age. Notice that the data frame churn_data still has all variables numerical, while the data frame churn_data_category include categorical columns that we just converted.

Converting Boolean values into Character format

churn_data$active_member[churn_data$active_member== 1]<-"Active"
churn_data$active_member[churn_data$active_member== 0]<-"In Active"

churn_data$credit_card[churn_data$credit_card== 1]<-"Credit Card"
churn_data$credit_card[churn_data$credit_card== 0]<-"No-Credit Card"

churn_data$churn[churn_data$churn== 1]<-"Churned"
churn_data$churn[churn_data$churn== 0]<-"Retained"

head(churn_data)

We decided to assign string values to the corresponding Boolean values for the few variables in the dataset,So that it would be much better to understand each variable clearly.

Ouliers For Customer data

Checking for outliers in the data frame for credit_score,age,tenure,balance,estimated_salary variables using outlierKD function.

credit_score_clean <- outlierKD2(churn_data, credit_score, rm = TRUE, boxplt = TRUE, qqplt = TRUE)

xkablesummary(credit_score_clean)

The proportion outliers for credit_score variable if 0.2% and it was removed successfully using outlierKD function.

age_clean <- outlierKD2(churn_data, age, rm = TRUE, boxplt = TRUE, qqplt = TRUE)

xkablesummary(age_clean)
tenure_clean <- outlierKD2(churn_data, tenure, rm = TRUE, boxplt = TRUE, qqplt = TRUE)

xkablesummary(tenure_clean)
balance_clean <- outlierKD2(churn_data, balance, rm = TRUE, boxplt = TRUE, qqplt = TRUE)

xkablesummary(balance_clean)
salary_clean <- outlierKD2(churn_data, estimated_salary, rm = TRUE, boxplt = TRUE, qqplt = TRUE)

xkablesummary(salary_clean)

By using outlierKD function we can observe that outliers where found only in age and credit_score variables i.e,3.7% and 0.2%,So we decided to remove using outlierKD function.

Customer Analysis

Analyzing each variable in the dataframe by using the relevant plots along with the calculated mean,SD and percentages.

library(ggplot2)
ggplot(data=churn_data, aes(x=credit_score)) + 
  geom_histogram(col="black",bins=30,
                 fill="dark orange", 
                  alpha =.7) + # opacity
  labs(title="`ggplot`") +
  labs(title="Customer's Credit Scores",x="Customer Credit Score", y="Customer Count") 

print(mean(churn_data$credit_score))
print(sd(churn_data$credit_score))

The average credit score of the customer is 650.529,most of the customer having credit score fall between 600 to 700 and standard deviation is 96.653.

ggplot(data=churn_data,aes(x=country,fill=country))+
  geom_bar(col="black")+
  scale_fill_brewer(palette="Reds") +
  labs(title = "Bank Customer vs Country",x="Country",y="Customer Count")+
  theme_minimal()

The customers are grouped by the countries in which they have their accounts.As we see from the plot France has more than 50% of customer account which is the highest among all other countries with Germany and Spain sharing equal percentages.

ggplot(churn_data, aes(x = age,fill=cut(age,100))) + 
  geom_histogram(show.legend = FALSE,col="black",bins=30)+
  scale_fill_discrete(h=c(240,10),c=120,l=70)+
  theme_minimal()+
  labs(x=" Customer Age",y=" Customer Count")+
  ggtitle("Customer ages")

print(mean(churn_data$age))
print(sd(churn_data$age))

The majority of the bank customer’s fall below the age of 50 with average age of 38.922 and with standard deviation of 10.488.

library(ggplot2)
ggplot(data=churn_data, aes(x=tenure)) + 
  geom_histogram(col="black",bins=11,
                 fill="Yellow", 
                  alpha =.7) + 
  labs(title="Years spent as customer",x="Number of years", y="Customer Count") 

Most of the customer as been with bank for more than a year.

df <- churn_data %>% 
  group_by(gender) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
str(df)

ggplot(df, aes(x = "", y = perc, fill =gender )) +
  geom_col(color="black") +
  geom_text(aes(label = labels),color = c("black", "black"),
            position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y")+
  labs(title="Percentage of Male and Female customers in bank",x="", y="") 

The Male customers are highest in percent of 55% and remaining 45% are Female customers.

df <- churn_data %>% 
  group_by(active_member) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
str(df)


ggplot(df, aes(x = "", y = perc, fill = active_member)) +
  geom_col(color="black") +
  geom_label(aes(label = labels),color = c("black", "white"),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")+
  scale_fill_grey()+
  labs(title="Percentage of active members in bank",x="", y="") 

Huge percentage of customer are being inactive,it is reported that 48.5% of customers are being inactive.

df <- churn_data %>% 
  group_by(credit_card) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
str(df)

ggplot(df, aes(x = "", y = perc, fill = credit_card)) +
  geom_col(color="black") +
  geom_label(aes(label = labels),color = c("black", "black"),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")+
  scale_fill_brewer(palette="Greens")+
  labs(title="Percentage of customer use credit card",x="", y="") 

Predominantly 71% of the Bank customer use credit_card and only 29% do not make use of it.

df <- churn_data %>% 
  group_by(products_number) %>% # Variable to be transformed
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
str(df)

ggplot(df, aes(x = "", y = perc, fill = products_number)) +
  geom_col(color="black") +
  geom_label(aes(label = labels),color = c("white", "white","white","white"),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")+
  labs(title="Percentage of  different products used by customers ",x="", y="") 

Most of the customer user product 1 which is 50% and product 4 is the least used with 0.6%.

df <- churn_data %>% 
  group_by(churn) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
str(df)

ggplot(df, aes(x = "", y = perc, fill = churn)) +
  geom_col(color="black") +
  geom_label(aes(label = labels),color = c("black", "black"),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")+
  scale_fill_brewer(palette="Purples")+
  labs(title="Percentage of customer retained",x="", y="") 

The bank managed to retain 80% of their customers with the remaining 20% where churned out.

Inferential statstics

Inferential statistics for customer credit scores

mean.credit_score = mean(churn_data$credit_score, na.rm = TRUE); #mean of the customer credit scores
str(mean.credit_score)
##  num 651
sd_credit_score = sd(churn_data$credit_score)
str(sd_credit_score)
##  num 96.7

Mean and SD for the entire population of credit_score in churn_data data frame mean=650.53 and sd=96.653.

Now let us create a smaller sample of the credit score out of the entire population, and call it churn_credit_sample.

set.seed(321) 
churn_credit_sample = churn_data[ sample(nrow(churn_data),4000), ]
format(mean(churn_credit_sample$credit_score), digits=5)
## [1] "650.19"
loadPkg("BSDA") 

ztestcredit95 = z.test(churn_credit_sample$credit_score, sigma.x = 96.7) # default conf.level = 0.95
ztestcredit95
ztestcredit99 = z.test(churn_credit_sample$credit_score, sigma.x = 96.7, conf.level=0.99 )
ztestcredit99
ztestcredit50 = z.test(churn_credit_sample$credit_score, sigma.x = 96.7, conf.level=0.50 )
ztestcredit50

ztestcredit95$conf.int
ztestcredit99$conf.int
ztestcredit50$conf.int
loadPkg("BSDA") 

ttestcredit95 = t.test(churn_credit_sample$credit_score) # default conf.level = 0.95
ttestcredit95
ttestcredit99 = t.test(churn_credit_sample$credit_score, conf.level=0.99 )
ttestcredit99
ttestcredit50 = t.test(churn_credit_sample$credit_score,  conf.level=0.50 )
ttestcredit50

ttestcredit95$conf.int
ttestcredit99$conf.int
ttestcredit50$conf.int

Inferential statistics for customer account balance.

mean.balance= mean(churn_data$balance, na.rm = TRUE); #mean of the customer account balance
str(mean.balance)
##  num 76486
sd_balance = sd(churn_data$balance)
str(sd_balance)
##  num 62397

Mean and SD for the entire population of credit_score in churn_data data frame mean=76486 and sd=62397.

Now let us create a smaller sample of the account balance out of the entire population, and call it churn_balance_sample.

set.seed(321) 
churn_balance_sample = churn_data[ sample(nrow(churn_data),4000), ]
format(mean(churn_balance_sample$balance), digits=5)
## [1] "75942"
loadPkg("BSDA") 

ztestbalance95 = z.test(churn_balance_sample$balance, sigma.x = 62397.4) # default conf.level = 0.95
ztestbalance95
ztestbalance99 = z.test(churn_balance_sample$balance, sigma.x = 62397.4, conf.level=0.99 )
ztestbalance99
ztestbalance50 = z.test(churn_balance_sample$balance, sigma.x = 62397.4, conf.level=0.50 )
ztestbalance50

ztestbalance95$conf.int
ztestbalance99$conf.int
ztestbalance50$conf.int
loadPkg("BSDA") 

ttestbalance95 = t.test(churn_balance_sample$balance) # default conf.level = 0.95
ttestbalance95
ttestbalance99 = t.test(churn_balance_sample$balance, conf.level=0.99 )
ttestbalance99
ttestbalance50 = t.test(churn_balance_sample$balance,  conf.level=0.50 )
ttestbalance50

ttestbalance95$conf.int
ttestbalance99$conf.int
ttestbalance50$conf.int

Churned Customer Analysis

Mid term Analysis

SMART Questions:

Does churn depend on Gender?

We are doing chi-square test to determine if the churn rate depends on Gender?

Chi Square Test for independence

H0: Customer’s gender and churn are independent

H1: Customer’s gender and churn are not independent

#Contingency table
contab = table(churn_data$churn, churn_data$gender)
xkabledply(contab, title="Contingency table for Gender(Male, Female) vs Churn (0 or 1)")

#Chisquare Test of independency
chitests = chisq.test(contab)
chitests

chitests$statistic
chitests$parameter
chitests$p.value

We have \(\chi^2\) value of the test from chitests$statistic = 112.919,while extracting the p-value from chitests$p.value = 2.248^{-26} which is less than significance level of 0.05 ,Hence we reject the null hypothesis H0 and we conclude that the churn rate dependent on gender of the customer.

ggplot(data=churn_data,aes(x=gender,fill=churn))+
  geom_bar(col="black")+
   scale_fill_manual('Position', values=c('red', 'lightblue'))+
  labs(title = "Gender vs Churn",x="Gender",y="Chrun")+
  theme_minimal()

We observed that majority of the churned customers were female despite the total population of the bank being predominantly male.

Does a user having a credit card with the bank affect their predisposition to churn?

Chi Square Test for independence

H0: Having a credit card and churn are independent

H1: Having a credit card and churn are not independent

churn_data[, 10:11][churn_data[, 10:11] == 1] <- 'Churn'
churn_data[, 10:11][churn_data[, 10:11] == 0] <- 'No Churn'

cc_churn <- table(churn_data$churn, churn_data$credit_card)
cc_churn

chitestccchurn <- chisq.test(cc_churn)
chitestccchurn



ggplot(data=churn_data,aes(x=credit_card,fill=churn))+
  geom_bar()+
  scale_fill_manual('Position', values=c('red', 'lightblue')) +
  labs(title = "Customer Churn in Customers With and Without Credit Cards",x="Credit Card",y="Churn")+
  theme_minimal()

Credit Card vs Churn P-value of 0.5 is much higher that the significance level (0.05 for df=1). Thus, we reject the null hypothesis H0, as there is a 0.5 or 50% chance of these results occurring by chance.

Does the churn rate depend upon which country the customer belongs to?

Chi Square Test for independence

H0: Country and churn are independent

H1: Country and churn are not independent

#Contingency table

contable = table(churn_data$country, churn_data$churn)
xkabledply(contable, title="Contingency table for Country (Customer belongs to) vs Churn (0 or 1)")

#Chisquare Test of independency
chitest <- chisq.test(contable)
chitest

chitest$statistic
chitest$parameter
chitest$p.value
### 7290530e7885e0b0a98352236618e34354265333

We have the \(\chi^2\) value of the test from chitests$statistic = 112.919, while the p-value is less than the significance level of 0.05 from chitests$p.value = 2.248^{-26}. As a result, we reject the null hypothesis H0 and find that the churn rate is reliant on the customer’s regions.

ggplot(data=churn_data,aes(x=country,fill=churn))+
  geom_bar()+
  scale_fill_manual('Position', values=c('red', 'lavenderblush2')) +
  labs(title = "Customer Churn in Different Countries",x="Country",y="Churn")+
  theme_minimal()

We can notice that most of the customers from germany and france are about to churned when compared to customers from spain.

Does churn depends on different banking product and services provide by the banks?

Chi Square Test for independence

H0: Banking products, services and churn are independent

H1: Banking products, services and churn are not independent

#Contingency table
contab_productnumber = table(churn_data$churn, churn_data$products_number)
xkabledply(contab_productnumber, title="Contingency table for Banking Services vs Churn (0 or 1)")

#Chisquare Test of independency
chitests = chisq.test(contab_productnumber)
chitests

chitests$statistic
chitests$parameter

We have the \(\chi^2\) value of the test from chitests$statistic = 1503.629, while the p-value is less than the significance level of 0.05 from chitests$p.value = 0. As a result, we reject the null hypothesis H0 and find that the churn rate is reliant on the customer’s active status.

ggplot(data=churn_data,aes(x=products_number,fill=churn))+
  geom_bar()+
  scale_fill_manual('Position', values=c('red', 'lavender')) +
  labs(title = "Banking Product and Services vs Churn",x="Banking Product and Services",y="Chrun")+
  theme_minimal()

We can notice that most of the customers with lower and higher products i.e) 1, 3 and 4 are about to churned when compared to customers with two products.

Does churn depend on status of active account users?

Chi Square Test for independence

H0: Customer’s active status and churn are independent

H1: Customer’s active status and churn are not independent

#Contingency table
contab_activemember = table(churn_data$churn, churn_data$active_member)
xkabledply(contab_activemember, title="Contingency table for Account Status(Active, In Active) vs Churn (0 or 1)")

#Chisquare Test of independency
chitests = chisq.test(contab_activemember)
chitests

chitests$statistic
chitests$parameter
chitests$p.value

We have the \(\chi^2\) value of the test from chitests$statistic = 242.985, while the p-value is less than the significance level of 0.05 from chitests$p.value = 8.786^{-55}. As a result, we reject the null hypothesis H0 and find that the churn rate is reliant on the customer’s banking services.

ggplot(data=churn_data,aes(x=active_member,fill=churn))+
  geom_bar()+
  scale_fill_manual('Position', values=c('red', 'pink1')) +
  labs(title = "Active Members vs Churn",x="Active Status of Account Holders",y="Churn")+
  theme_minimal()

Here, we can interpret that most of the Inactive customers are about to churned.

Does a higher/lower account balance of the customer affect churn?

Correlation test between account balance and churn

library(corrplot)
churn_data_to_numeric <- churn_data
churn_data_to_numeric$churn <- as.numeric(churn_data_to_numeric$churn) 

churn_data_numeric <- select_if(churn_data_to_numeric, is.numeric)

cors_bal <- cor(churn_data_numeric)
corrplot(cors_bal, method = 'number')

cor_balance<-cor.test(churn_data$balance,as.numeric(churn_data$churn), method="pearson")
cor_balance

Customer churn is weakly correlated with account balance with a score of 0.12 For further analysis we can check the correlation score by combining multiple variables

loadPkg("BSDA") 

ttestbalances95 = t.test(churn_balance_sample$credit_score) # default conf.level = 0.95
ttestbalances95
ttestbalances99 = t.test(churn_balance_sample$credit_score, conf.level=0.99 )
ttestbalances99
ttestbalances50 = t.test(churn_balance_sample$credit_score,  conf.level=0.50 )
ttestbalances50

ttestbalances95$conf.int
ttestbalances99$conf.int
ttestbalances50$conf.int

Does a higher/lower credit score of the customer affect churn?

Correlation test between credit score and churn

library(corrplot)
churn_data_to_numeric <- churn_data

churn_data$churn[churn_data$churn== "Churned"]<-1
churn_data$churn[churn_data$churn== "Retained"]<-0


churn_data_to_numeric$churn <- as.numeric(churn_data_to_numeric$churn) 
str(churn_data_to_numeric)
churn_data_numeric <- select_if(churn_data_to_numeric, is.numeric)
str(churn_data_numeric)
cors_credit <- cor(churn_data_numeric)
corrplot(cors_credit, method = 'number')

cor_scores <- cor(churn_data_numeric$credit_score, churn_data_numeric$churn)
cor_scores

The credit score is weakly correlated with customer churn with a score of -0.0271

Does churn depend on age of the customers?

library(corrplot)
churn_data_to_numeric <- churn_data

churn_data$churn[churn_data$churn== "Churned"]<-1
churn_data$churn[churn_data$churn== "Retained"]<-0


churn_data_to_numeric$churn <- as.numeric(churn_data_to_numeric$churn) 
str(churn_data_to_numeric)
churn_data_numeric <- select_if(churn_data_to_numeric, is.numeric)
str(churn_data_numeric)
cors_age_group <- cor(churn_data_numeric)
corrplot(cors_age_group, method = 'number')

cor_age <- cor(churn_data_numeric$age, churn_data_numeric$churn)
cor_age
loadPkg("BSDA") 

ttestage95 = t.test(churn_balance_sample$age) # default conf.level = 0.95
ttestage95
ttestage99 = t.test(churn_balance_sample$age, conf.level=0.99 )
ttestage99
ttestage50 = t.test(churn_balance_sample$age,  conf.level=0.50 )
ttestage50

ttestage95$conf.int
ttestage99$conf.int
ttestage50$conf.int
ggplot(churn_data, aes(x=age, fill=churn)) +
  geom_histogram( color='#e9ecef', alpha=0.5, position='identity') + scale_fill_manual('Position', values=c('red', 'pink1')) +
  labs(title = "Age Distribution across Churned and No Churn Account Holders",x="Age of Account Holders",y="Frequency")

#Contingency table
churn_age = table(churn_data$churn, churn_data$age)
churn_age

#Chisquare Test of independence
chichurn_age = chisq.test(churn_age)
chichurn_age

chichurn_age$statistic
chichurn_age$parameter
chichurn_age$p.value

Final Project Analysis

*We observed that correlation of age variable is very high,so we decided to combine different columns and find if the correlation score improves or not

corre<-cor(churn_data_numeric)
corre
corrplot(corre, method = "number")

Churn & Age are about to moderately correlated (Positive) - 0.29.

corr <- cor(churn_data_numeric$age, churn_data_numeric$churn)
corr
## [1] 0.285
corr1 <- cor(churn_data_numeric$age + churn_data_numeric$tenure, churn_data_numeric$churn)
corr1
## [1] 0.272
corr2 <- cor(churn_data_numeric$age + churn_data_numeric$products_number, churn_data_numeric$churn)
corr2
## [1] 0.283
corr3 <- cor(churn_data_numeric$age + churn_data_numeric$balance, churn_data_numeric$churn)
corr3
## [1] 0.119
corr4 <- cor(churn_data_numeric$age + churn_data_numeric$tenure + churn_data_numeric$products_number, churn_data_numeric$churn)
corr4
## [1] 0.269
corr5 <- cor(churn_data_numeric$age + churn_data_numeric$products_number + churn_data_numeric$balance, churn_data_numeric$churn)
corr5
## [1] 0.119

The correlation score is the highest for the age variable. After combining different columns and finding the correlation score we see that it starts decreasing from 0.285

Since data set is unbalanced we decided to use oversampling technique to balance the dataset using library called ROSE

The data set was unbalance majorly because the variable country had most of the data which belongs to France,Hence we need to do perform data balancing.

table(churn_data$churn)
table(churn_data$country)
library(ROSE)

sum(churn_data$country == 'France')
## [1] 5014
sum(churn_data$country == 'Germany')
## [1] 2509
sum(churn_data$country == 'Spain')
## [1] 2477
ind <- sample(2, nrow(churn_data), replace = TRUE, prob = c(0.7, 0.3))
train <- churn_data[ind==1,]
test <- churn_data[ind==2,]
data_balanced_over <- ovun.sample(churn ~ ., data = train, method = "over",N = 10000,seed=123)$data
table(data_balanced_over$churn)
table(data_balanced_over$country)
ggplot(data=data_balanced_over,aes(x=country,fill=country))+
  geom_bar(col="black")+
  scale_fill_brewer(palette="Reds") +
  labs(title = "Bank Customer vs Country",x="Country",y="Customer Count")+
  theme_minimal()

Model Building and trainig

Let us see the probability of customer are churned/not churned.

prop.table( table(churn_data$churn) )

we observe that his probability table tells you that around 20 percent of the customers are churned. If those customers are all the ones that are having good relationship with the bank, then the bank is going to face severe strike on its business rate. Let’s use the different modelling methods to train our dataset to see if we can find out what’s causing customers to close the account with the particular bank.

str(churn_data)
churn_data_pc<-data_balanced_over
churn_data_pc$country[churn_data$country == 'France'] <- 0
churn_data_pc$country[churn_data$country == 'Germany'] <- 1
churn_data_pc$country[churn_data$country == 'Spain'] <- 2

churn_data_pc$gender[churn_data$gender == 'Male'] <- 0
churn_data_pc$gender[churn_data$gender == 'Female'] <- 1

churn_data_pc$credit_card[churn_data$credit_card == 'No-Credit Card'] <- 0
churn_data_pc$credit_card[churn_data$credit_card == 'Credit Card'] <- 1

churn_data_pc$active_member[churn_data$active_member == 'In Active'] <- 0
churn_data_pc$active_member[churn_data$active_member == 'Active'] <- 1

churn_data_pc$active_member = as.numeric(churn_data_pc$active_member)
churn_data_pc$credit_card = as.numeric(churn_data_pc$credit_card)
churn_data_pc$country = as.numeric(churn_data_pc$country)
churn_data_pc$credit_score = as.numeric(churn_data_pc$credit_score)
churn_data_pc$gender = as.numeric(churn_data_pc$gender)
churn_data_pc$churn = as.numeric(churn_data_pc$churn)

churn_data_pc_target = churn_data_pc #includes 'churn'
churn_data_logit <- churn_data_pc_target
str(churn_data_logit)
churn_data_pc <- churn_data_logit[-c(11)] #No churn
str(churn_data_pc)
str(churn_data_pc_target)
str(churn_data_pc)
churn_data_scale <- data.frame(scale(churn_data_pc))
xkabledply(cor(churn_data_scale))
xkabledply(cov(churn_data_scale))

SMART QUESTIONS

Does the combination of customer demographic variables country, age group, gender influence the churn rate?

churnLM <- lm(churn~country+age+gender,data=churn_data_pc_target)
summary(churnLM)
## 
## Call:
## lm(formula = churn ~ country + age + gender, data = churn_data_pc_target)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.268 -0.373 -0.214  0.463  0.927 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.218369   0.019199  -11.37   <2e-16 ***
## country     -0.005953   0.005610   -1.06     0.29    
## age          0.016364   0.000436   37.58   <2e-16 ***
## gender      -0.013606   0.009324   -1.46     0.14    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.464 on 9996 degrees of freedom
## Multiple R-squared:  0.124,  Adjusted R-squared:  0.124 
## F-statistic:  471 on 3 and 9996 DF,  p-value: <2e-16
xkabledply(churnLM, title = "churn Model summary")
churn Model summary
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2184 0.0192 -11.37 0.000
country -0.0060 0.0056 -1.06 0.289
age 0.0164 0.0004 37.58 0.000
gender -0.0136 0.0093 -1.46 0.144
confint(churnLM)
##               2.5 %   97.5 %
## (Intercept) -0.2560 -0.18074
## country     -0.0170  0.00504
## age          0.0155  0.01722
## gender      -0.0319  0.00467

Whether the customers using fewer products with higher salary,account balance and status of the account affect the churn rate?

churnLM2 <- lm(churn~products_number+estimated_salary+balance+active_member,data=churn_data_pc_target)
summary(churnLM2)
## 
## Call:
## lm(formula = churn ~ products_number + estimated_salary + balance + 
##     active_member, data = churn_data_pc_target)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.633 -0.460 -0.325  0.521  0.704 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.17e-01   1.81e-02   17.51   <2e-16 ***
## products_number  -6.23e-03   7.58e-03   -0.82    0.411    
## estimated_salary  1.86e-07   8.54e-08    2.18    0.029 *  
## balance           1.24e-06   8.18e-08   15.13   <2e-16 ***
## active_member     1.54e-02   9.80e-03    1.57    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 9995 degrees of freedom
## Multiple R-squared:  0.0247, Adjusted R-squared:  0.0243 
## F-statistic: 63.2 on 4 and 9995 DF,  p-value: <2e-16
xkabledply(churnLM2, title = "churn Model summary")
churn Model summary
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3175 0.0181 17.515 0.0000
products_number -0.0062 0.0076 -0.823 0.4107
estimated_salary 0.0000 0.0000 2.183 0.0291
balance 0.0000 0.0000 15.131 0.0000
active_member 0.0154 0.0098 1.566 0.1173
confint(churnLM2)
##                      2.5 %   97.5 %
## (Intercept)       2.82e-01 3.53e-01
## products_number  -2.11e-02 8.62e-03
## estimated_salary  1.90e-08 3.54e-07
## balance           1.08e-06 1.40e-06
## active_member    -3.86e-03 3.46e-02

What are the principal components for predicting churn rate?

pr.out=prcomp(churn_data_pc, scale =TRUE)
print("Case: z-score/scaled")
summary(pr.out)
pr.out$rotation
nrow(train)
prop.table( table(train$churn) )
biplot(pr.out, scale = 0)

biplot(pr.out,2:3, scale =0)

biplot(pr.out,3:4, scale =0)

pr.var <- (pr.out$sdev^2)
pve <- pr.var/sum(pr.var)
plot(cumsum(pve), xlab="Principal Component (standardized)", ylab ="Cumulative Proportion of Variance Explained",ylim=c(0,1),type="b")

Which model can give the best results for adjusted R square and ROC value with lower AIC and BIC?

Feature selection using Exhaustive search

loadPkg("leaps")

reg.churn <- regsubsets(churn~., data = churn_data_pc_target, nvmax =9, nbest = 1, method = "exhaustive")  
plot(reg.churn , scale = "adjr2", main = "Adjusted R^2")

plot(reg.churn , scale = "r2", main = "R^2")

plot(reg.churn , scale = "bic", main = "BIC")

plot(reg.churn , scale = "Cp", main = "Cp")

summary(reg.churn )

Using exhaustive search age,balance were selected which is a two variable model with adjusted R^2 value 0.14.

The best model selected using BIC is credit_score,age,balance which is a 3 variable model with BIC value -1500.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model. with Cp value 7.

loadPkg("car")

# Adjusted R2
car::subsets(reg.churn, statistic="adjr2", legend = FALSE, min.size = 2, main = "Adjusted R^2")

subsets(reg.churn, statistic="cp", legend = FALSE, min.size = 4, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 3)  

The Mallow Cp plot selected tw0 best models.

1)credit_score,country,gender,age,tenure,balance,active_member,estimated salary 8 variable model 2)credit_score,country,gender,age,tenure,balance,estimated_salary 7 variable model.

Feature selection using forward search

reg.churnforward <- regsubsets(churn~., data = churn_data_pc_target, nvmax =9, nbest = 1, method = "forward")
plot(reg.churnforward , scale = "adjr2", main = "Adjusted R^2")

plot(reg.churnforward , scale = "bic", main = "BIC")

plot(reg.churnforward , scale = "Cp", main = "Cp")

summary(reg.churnforward)

The best model selected using forward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

Feature selection using backward search

reg.churnback <- regsubsets(churn~., data = churn_data_pc_target, method = "backward", nvmax = 9, nbest = 1)
plot(reg.churnback, scale = "adjr2", main = "Adjusted R^2")

plot(reg.churnback, scale = "bic", main = "BIC")

plot(reg.churnback, scale = "Cp", main = "Cp")

summary(reg.churnback)

The best model selected using backward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

Feature selection using Sequential Replacement

reg.churnseqrep <- regsubsets(churn~., data =churn_data_pc_target, nvmax = 9, nbest = 1 , method = "seqrep")
plot(reg.churnseqrep, scale = "adjr2", main = "Adjusted R^2")

plot(reg.churnseqrep, scale = "bic", main = "BIC")

plot(reg.churnseqrep, scale = "Cp", main = "Cp")

The best model selected using sequential search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.

The best model selected using BIC is credit_card,age,balance,estimated_salary which is a 4 variable model with value -1500 Cp.

The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.

Logistic Regression model

What factors affect the customer churn rate significantly?

churn_logit <- glm(churn ~., data = churn_data_logit, family = 'binomial')

summary(churn_logit)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.758  -0.944  -0.640   1.079   2.183  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.12e+00   1.97e-01  -15.87  < 2e-16 ***
## credit_score     -8.49e-04   2.24e-04   -3.79  0.00015 ***
## country          -3.36e-02   2.64e-02   -1.27  0.20351    
## gender           -6.32e-02   4.39e-02   -1.44  0.14990    
## age               7.47e-02   2.29e-03   32.57  < 2e-16 ***
## tenure           -1.56e-02   7.47e-03   -2.09  0.03657 *  
## balance           5.19e-06   3.70e-07   14.04  < 2e-16 ***
## products_number  -2.21e-02   3.32e-02   -0.67  0.50570    
## credit_card      -4.12e-02   4.80e-02   -0.86  0.39019    
## active_member     4.76e-02   4.37e-02    1.09  0.27600    
## estimated_salary  1.01e-06   3.81e-07    2.65  0.00817 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13698  on 9999  degrees of freedom
## Residual deviance: 12152  on 9989  degrees of freedom
## AIC: 12174
## 
## Number of Fisher Scoring iterations: 4
xkabledply( confint(churn_logit), title = "CIs using profiled log-likelihood" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) -3.5122 -2.7400
credit_score -0.0013 -0.0004
country -0.0854 0.0181
gender -0.1493 0.0228
age 0.0703 0.0793
tenure -0.0303 -0.0010
balance 0.0000 0.0000
products_number -0.0872 0.0430
credit_card -0.1351 0.0528
active_member -0.0380 0.1333
estimated_salary 0.0000 0.0000
xkabledply( confint.default(churn_logit), title = "CIs using standard errors" )
CIs using standard errors
2.5 % 97.5 %
(Intercept) -3.5109 -2.7388
credit_score -0.0013 -0.0004
country -0.0853 0.0182
gender -0.1493 0.0228
age 0.0702 0.0792
tenure -0.0303 -0.0010
balance 0.0000 0.0000
products_number -0.0872 0.0430
credit_card -0.1352 0.0528
active_member -0.0381 0.1333
estimated_salary 0.0000 0.0000
loadPkg("regclass")
xkabledply( confusion_matrix(churn_logit), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 4589 1052 5641
Actual 1 1959 2400 4359
Total 6548 3452 10000
loadPkg("pROC") 
prob=predict(churn_logit, type = "response" )
churn_data_logit$prob=prob
h <- roc(churn~prob, data = churn_data_logit)
auc(h) 
## Area under the curve: 0.745
plot(h)

Based on the models obtained from feature selection we are using logistic regression to check the churn rate of the customers.

churnglm1 <- glm(churn ~  age + balance, data = churn_data_logit)
summary(churnglm1)
## 
## Call:
## glm(formula = churn ~ age + balance, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.302  -0.371  -0.190   0.453   0.952  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.05e-01   1.88e-02   -16.2   <2e-16 ***
## age          1.60e-02   4.32e-04    37.1   <2e-16 ***
## balance      1.10e-06   7.51e-08    14.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.211)
## 
##     Null deviance: 2458.9  on 9999  degrees of freedom
## Residual deviance: 2109.6  on 9997  degrees of freedom
## AIC: 12826
## 
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm1), title = "Confusion matrix from Logit Model-1" )
Confusion matrix from Logit Model-1
Predicted 0 Predicted 1 Total
Actual 0 4633 1008 5641
Actual 1 2026 2333 4359
Total 6659 3341 10000
prob1=predict(churnglm1, type = "response" )
churn_data_logit$prob=prob1
h1 <- roc(churn~prob, data = churn_data_logit)
auc(h1) 
## Area under the curve: 0.744
plot(h1)

churnglm2 <- glm(churn ~  credit_card+age + balance, data = churn_data_logit)
summary(churnglm2)
## 
## Call:
## glm(formula = churn ~ credit_card + age + balance, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.299  -0.372  -0.189   0.453   0.955  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.98e-01   2.01e-02  -14.80   <2e-16 ***
## credit_card -9.25e-03   1.01e-02   -0.92     0.36    
## age          1.60e-02   4.32e-04   37.09   <2e-16 ***
## balance      1.10e-06   7.52e-08   14.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.211)
## 
##     Null deviance: 2458.9  on 9999  degrees of freedom
## Residual deviance: 2109.4  on 9996  degrees of freedom
## AIC: 12827
## 
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm2), title = "Confusion matrix from Logit Model-2" )
Confusion matrix from Logit Model-2
Predicted 0 Predicted 1 Total
Actual 0 4622 1019 5641
Actual 1 1996 2363 4359
Total 6618 3382 10000
prob2=predict(churnglm2, type = "response" )
churn_data_logit$prob=prob2
h2 <- roc(churn~prob, data = churn_data_logit)
auc(h2) 
## Area under the curve: 0.744
plot(h2)

churnglm3 <- glm(churn ~ credit_score+age+tenure+ balance + estimated_salary, data = churn_data_logit)
summary(churnglm3)
## 
## Call:
## glm(formula = churn ~ credit_score + age + tenure + balance + 
##     estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.310  -0.372  -0.182   0.455   0.984  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.91e-01   3.78e-02   -5.05  4.5e-07 ***
## credit_score     -1.83e-04   4.70e-05   -3.89  9.9e-05 ***
## age               1.60e-02   4.31e-04   37.14  < 2e-16 ***
## tenure           -3.29e-03   1.57e-03   -2.10   0.0359 *  
## balance           1.10e-06   7.51e-08   14.70  < 2e-16 ***
## estimated_salary  2.10e-07   8.00e-08    2.63   0.0086 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.211)
## 
##     Null deviance: 2458.9  on 9999  degrees of freedom
## Residual deviance: 2103.9  on 9994  degrees of freedom
## AIC: 12805
## 
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm3), title = "Confusion matrix from Logit Model-3" )
Confusion matrix from Logit Model-3
Predicted 0 Predicted 1 Total
Actual 0 4624 1017 5641
Actual 1 2022 2337 4359
Total 6646 3354 10000
prob3=predict(churnglm3, type = "response" )
churn_data_logit$prob=prob3
h3 <- roc(churn~prob, data = churn_data_logit)
auc(h3) 
## Area under the curve: 0.745
plot(h3)

churnglm4 <- glm(churn ~ credit_score + country+ gender+age+tenure+balance+active_member+estimated_salary
, data = churn_data_logit)
summary(churnglm4)
## 
## Call:
## glm(formula = churn ~ credit_score + country + gender + age + 
##     tenure + balance + active_member + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.306  -0.373  -0.181   0.454   0.992  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.86e-01   3.84e-02   -4.84  1.3e-06 ***
## credit_score     -1.82e-04   4.70e-05   -3.87  0.00011 ***
## country          -6.99e-03   5.55e-03   -1.26  0.20754    
## gender           -1.28e-02   9.22e-03   -1.38  0.16618    
## age               1.60e-02   4.31e-04   37.14  < 2e-16 ***
## tenure           -3.31e-03   1.57e-03   -2.11  0.03491 *  
## balance           1.11e-06   7.51e-08   14.73  < 2e-16 ***
## active_member     1.04e-02   9.19e-03    1.13  0.25681    
## estimated_salary  2.09e-07   8.00e-08    2.61  0.00902 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2458.9  on 9999  degrees of freedom
## Residual deviance: 2102.9  on 9991  degrees of freedom
## AIC: 12806
## 
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm4), title = "Confusion matrix from Logit Model-5" )
Confusion matrix from Logit Model-5
Predicted 0 Predicted 1 Total
Actual 0 4624 1017 5641
Actual 1 1994 2365 4359
Total 6618 3382 10000
prob4=predict(churnglm4, type = "response" )
churn_data_logit$prob=prob4
h4 <- roc(churn~prob, data = churn_data_logit)
auc(h4) 
## Area under the curve: 0.745
plot(h4)

churnglm5 <- glm(churn ~ credit_score+country+gender+age+tenure+balance+estimated_salary, data = churn_data_logit)
summary(churnglm5)
## 
## Call:
## glm(formula = churn ~ credit_score + country + gender + age + 
##     tenure + balance + estimated_salary, data = churn_data_logit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.301  -0.372  -0.181   0.453   0.992  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.81e-01   3.82e-02   -4.75  2.1e-06 ***
## credit_score     -1.81e-04   4.70e-05   -3.86  0.00011 ***
## country          -6.95e-03   5.55e-03   -1.25  0.21023    
## gender           -1.30e-02   9.22e-03   -1.41  0.15830    
## age               1.60e-02   4.31e-04   37.16  < 2e-16 ***
## tenure           -3.28e-03   1.57e-03   -2.09  0.03664 *  
## balance           1.11e-06   7.51e-08   14.72  < 2e-16 ***
## estimated_salary  2.09e-07   8.00e-08    2.61  0.00900 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.21)
## 
##     Null deviance: 2458.9  on 9999  degrees of freedom
## Residual deviance: 2103.1  on 9992  degrees of freedom
## AIC: 12805
## 
## Number of Fisher Scoring iterations: 2
xkabledply( confusion_matrix(churnglm5), title = "Confusion matrix from Logit Model-6" )
Confusion matrix from Logit Model-6
Predicted 0 Predicted 1 Total
Actual 0 4619 1022 5641
Actual 1 2002 2357 4359
Total 6621 3379 10000
prob5=predict(churnglm5, type = "response" )
churn_data_logit$prob=prob5
h5 <- roc(churn~prob, data = churn_data_logit)
auc(h5) 
## Area under the curve: 0.745
plot(h5)